Show the code
import pandas as pd
import numpy as np
import sqlite3
from lets_plot import *
import json
LetsPlot.setup_html(isolated_frame=True)Course DS 250
Matthew Foushee
import pandas as pd
import numpy as np
import sqlite3
from lets_plot import *
import json
LetsPlot.setup_html(isolated_frame=True)
#Location of Json file
json_file = 'flights_missing.json'
#opening and creating Dataframe from Json file
json = json.load(open(json_file))
dataFrame = pd.json_normalize(json)Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.
Most of all of the data was there however a handful of the values were either empty, containing no value, or already Nan(Null), This code replaces all empty values with NaN to provide the data with more consistency.
airport_code IAD
airport_name NaN
month January
year 2005.0
num_of_flights_total 12381
num_of_delays_carrier 414
num_of_delays_late_aircraft 1058
num_of_delays_nas 895
num_of_delays_security 4
num_of_delays_weather 61
num_of_delays_total 2430
minutes_delayed_carrier NaN
minutes_delayed_late_aircraft 70919
minutes_delayed_nas 35660.0
minutes_delayed_security 208
minutes_delayed_weather 4497
minutes_delayed_total 134881
Name: 2, dtype: object
Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.
The Worst airport when it comes to delays is the San Francisco International Airport. They on average have a 36 minute delay on each flight.
# Include and execute your code here
airportsData = {}
#Needs
#Total number of flights
#Total number of delayed flights
#proportion of delayted flights
#average delay time in hours
worstTime = 0
worstAirport = ""
for row in dataFrame.itertuples(index=False):
airportdata = {
"numberOfFlights" : row.num_of_flights_total,
"numberOfDelayedFlights" : row.num_of_delays_total,
"proportionalOnTimeToDelayed" : round(row.num_of_delays_total / row.num_of_flights_total, 3),
"AverageDelayTime" : round((row.minutes_delayed_total/60) / row.num_of_flights_total, 3)
}
if(row.airport_name not in airportsData):
airportsData[row.airport_name] = airportdata
continue
else:
airportsData[row.airport_name]["numberOfFlights"] += airportdata["numberOfFlights"]
airportsData[row.airport_name]["numberOfDelayedFlights"] += airportdata["numberOfDelayedFlights"]
airportsData[row.airport_name]["proportionalOnTimeToDelayed"] += airportdata["proportionalOnTimeToDelayed"]
airportsData[row.airport_name]["AverageDelayTime"] += airportdata["AverageDelayTime"]
aiportdelayaverage = (row.minutes_delayed_total/60) / row.num_of_flights_total
if worstTime < aiportdelayaverage:
worstTime = aiportdelayaverage
worstAirport = row.airport_name
print(airportsData){'Atlanta, GA: Hartsfield-Jackson Atlanta International': {'numberOfFlights': 4235114, 'numberOfDelayedFlights': 870910, 'proportionalOnTimeToDelayed': 25.663, 'AverageDelayTime': 25.554999999999996}, 'Denver, CO: Denver International': {'numberOfFlights': 2323376, 'numberOfDelayedFlights': 439964, 'proportionalOnTimeToDelayed': 23.02600000000001, 'AverageDelayTime': 20.618000000000002}, nan: {'numberOfFlights': 884879, 'numberOfDelayedFlights': 172413, 'proportionalOnTimeToDelayed': 10.349999999999996, 'AverageDelayTime': 10.190000000000005}, "Chicago, IL: Chicago O'Hare International": {'numberOfFlights': 3400032, 'numberOfDelayedFlights': 773122, 'proportionalOnTimeToDelayed': 28.174000000000003, 'AverageDelayTime': 31.59799999999999}, 'San Diego, CA: San Diego International': {'numberOfFlights': 870161, 'numberOfDelayedFlights': 167747, 'proportionalOnTimeToDelayed': 23.964000000000013, 'AverageDelayTime': 18.882000000000005}, 'San Francisco, CA: San Francisco International': {'numberOfFlights': 1565257, 'numberOfDelayedFlights': 408631, 'proportionalOnTimeToDelayed': 33.110000000000014, 'AverageDelayTime': 34.231000000000016}, 'Salt Lake City, UT: Salt Lake City International': {'numberOfFlights': 1293072, 'numberOfDelayedFlights': 190733, 'proportionalOnTimeToDelayed': 17.752000000000002, 'AverageDelayTime': 14.789}, 'Washington, DC: Washington Dulles International': {'numberOfFlights': 773480, 'numberOfDelayedFlights': 152630, 'proportionalOnTimeToDelayed': 23.526000000000007, 'AverageDelayTime': 24.020999999999994}}
What is the best month to fly if you want to avoid delays of any length? Describe the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)
This chart shows that the best month to fly is in October, it has the lowest delay averages.
# Include and execute your code here
monthData = []
for i in range(12):
monthData.append({"month": "", "delay": 0})
months = {
'january': 1,
'febuary': 2,
'march': 3,
'april': 4,
'may': 5,
'june': 6,
'july': 7,
'august': 8,
'september': 9,
'october': 10,
'november': 11,
'december': 12,
}
#Needs
#Total number of flights
#Total number of delayed flights
#proportion of delayted flights
#average delay time in hours
for row in dataFrame.itertuples(index=False):
if(row.month == 'n/a'):
continue
monthDelayAverage = (row.minutes_delayed_total / 60) / row.num_of_flights_total
monthData[months[row.month.lower()] - 1]["month"] = row.month
monthData[months[row.month.lower()] - 1]["delay"] += monthDelayAverage
monthData[months[row.month.lower()] - 1]["delay"] /= 2
month_df = pd.DataFrame(monthData)
(
ggplot(data=month_df, mapping=aes(x="month", y="delay"))
+ geom_line()
)According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations: A. 100% of delayed flights in the Weather category are due to weather B. 30% of all delayed flights in the Late-Arriving category are due to weather C. From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%
import pandas as pd
import numpy as np
import sqlite3
from lets_plot import *
import json
LetsPlot.setup_html(isolated_frame=True)
json_file = 'flights_missing.json'
months = {
'january': 1,
'febuary': 2,
'march': 3,
'april': 4,
'may': 5,
'june': 6,
'july': 7,
'august': 8,
'september': 9,
'october': 10,
'november': 11,
'december': 12,
}
#opening and creating Dataframe from Json file
json = json.load(open(json_file))
dataFrame = pd.json_normalize(json)
# Include and execute your code here
mean_aircraft = 0
count = 0
for row in dataFrame.itertuples(index=False):
if(row.num_of_delays_late_aircraft < 0):
continue
else:
mean_aircraft += row.num_of_delays_late_aircraft
count += 1
mean_aircraft /= count
mean_aircraft = round(mean_aircraft, 0)
dataFrame.replace(-999, mean_aircraft, inplace=True)
delayedByWeather = []
dataFrame
for row in dataFrame.itertuples(index=False):
appendNumber = row.minutes_delayed_weather
appendNumber += (row.minutes_delayed_late_aircraft * 0.3)
if(row.month == 'n/a'):
appendNumber += (row.minutes_delayed_nas * 0.65)
elif(months[row.month.lower()] >= 4 and months[row.month.lower()] <= 8):
appendNumber += (row.minutes_delayed_nas * 0.4)
else:
appendNumber += (row.minutes_delayed_nas * 0.65)
delayedByWeather.append(round(appendNumber, 0))
dataFrame["minutes_delated_all_weather"] = delayedByWeatherUsing the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.
The Chicago O’Hare International airport has the MOST delays by weather and SAN and IAD have by far the least
# Include and execute your code here
newDataFrame = [
{"airport_code": "ATL", "delay_sum": 0, "count": 0},
{"airport_code": "DEN", "delay_sum": 0, "count": 0},
{"airport_code": "IAD", "delay_sum": 0, "count": 0},
{"airport_code": "ORD", "delay_sum": 0, "count": 0},
{"airport_code": "SAN", "delay_sum": 0, "count": 0},
{"airport_code": "SFO", "delay_sum": 0, "count": 0},
{"airport_code": "SLC", "delay_sum": 0, "count": 0}
]
for row in dataFrame.itertuples(index=False):
for item in newDataFrame:
if item["airport_code"] == row.airport_code:
if pd.isna(row.minutes_delated_all_weather):
continue
item["delay_sum"] += row.minutes_delated_all_weather
item["count"] += 1
for item in newDataFrame:
item["delay"] = item["delay_sum"] / item["count"]
new_df = pd.DataFrame(newDataFrame)
(
ggplot(data=new_df, mapping=aes(x="airport_code", y="delay"))
+ geom_bar(stat="identity")
)Which delay is the worst delay? Create a similar analysis as above for Weahter Delay with: Carrier Delay and Security Delay. Compare the proportion of delay for each of the three categories in a Chart and a Table. Describe your results